Recovering Casing and Punctuation using Conditional Random Fields
نویسندگان
چکیده
This paper describes the winning entry to the ALTA Shared Task 2013. The theme of the shared task was recovery of casing and punctuation information from degraded English text. We tackle the task as a sequential labeling problem, jointly learning the casing and punctuation labels. We implement our sequential classifier using conditional random fields, trained using linguistic features extracted with offthe-shelf tools with simple adaptations to the specific task. We show the improvement due to adding each feature we consider, as well as the improvment due to utilizing additional training data beyond that supplied by the shared task organizers.
منابع مشابه
A CRF Sequence Labeling Approach to Chinese Punctuation Prediction
This paper presents a conditional random fields based labeling approach to Chinese punctuation prediction. To this end, we first reformulate Chinese punctuation prediction as a multiple-pass labeling task on a sequence of words, and then explore various features from three linguistic levels, namely words, phrase and functional chunks for punctuation prediction under the framework of conditional...
متن کاملConditional Random Fields for Automatic Punctuation
We model the relationship between sentences and their punctuation labels using conditional random fields. Some feature functions are hand-designed and others are generated by templates. We train the same model by stochastic gradient ascent, Collins Perceptron and contrastive divergence respectively and compare their performance. On the provided dataset, we achieve word-level accuracy of 94.56%....
متن کاملPunctuation Prediction using Linear Chain Conditional Random Fields
We investigate the task of punctuation prediction in English sentences without prosodic information. In our approach, stochastic gradient ascent (SGA) is used to maximize log conditional likelihood when learning the parameters of linear-chain conditional random fields. For SGA, two different approximation techniques, namely Collins perceptron and contrastive divergence, are used to estimate the...
متن کاملBetter Punctuation Prediction with Dynamic Conditional Random Fields
This paper focuses on the task of inserting punctuation symbols into transcribed conversational speech texts, without relying on prosodic cues. We investigate limitations associated with previous methods, and propose a novel approach based on dynamic conditional random fields. Different from previous work, our proposed approach is designed to jointly perform both sentence boundary and sentence ...
متن کاملDynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction
The use of dynamic conditional random fields (DCRF) has been shown to outperform linear-chain conditional random fields (LCRF) for punctuation prediction on conversational speech texts [1]. In this paper, we combine lexical, prosodic, and modified n-gram score features into the DCRF framework for a joint sentence boundary and punctuation prediction task on TDT3 English broadcast news. We show t...
متن کامل